Anonlink Entity Service API¶

This tutorial demonstrates interacting with the entity service via the REST API. The primary alternative is to use a library or command line tool such as `clkhash <http://clkhash.readthedocs.io/>`__ which can handle the communication with the anonlink entity service.

Dependencies¶

In this tutorial we interact with the REST API using the requests Python library. Additionally we use the clkhash Python library in this tutorial to define the linkage schema and to encode the PII. The synthetic dataset comes from the recordlinkage package. All the dependencies can be installed with pip:

pip install requests clkhash recordlinkage

Steps¶

Check connection to Anonlink Entity Service
Synthetic Data generation and encoding
Create a new linkage project
Upload the encodings
Create a run
Retrieve and analyse results

[1]:

import json
import os
import time
import requests

from IPython.display import clear_output

Check Connection¶

If you are connecting to a custom entity service, change the address here.

[2]:

server = os.getenv("SERVER", "https://testing.es.data61.xyz")
url = server + "/api/v1/"
print(f'Testing anonlink-entity-service hosted at {url}')

Testing anonlink-entity-service hosted at https://testing.es.data61.xyz/api/v1/

[3]:

requests.get(url + 'status').json()

[3]:

{'project_count': 2278, 'rate': 3863861, 'status': 'ok'}

Data preparation¶

This section won’t be explained in great detail as it directly follows the clkhash tutorials.

We encode a synthetic dataset from the recordlinkage library using clkhash.

[4]:

from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

[5]:

dfA, dfB = load_febrl4()

[6]:

with open('a.csv', 'w') as a_csv:
    dfA.to_csv(a_csv, line_terminator='\n')

with open('b.csv', 'w') as b_csv:
    dfB.to_csv(b_csv, line_terminator='\n')

Schema Preparation¶

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for encoding PII into CLKs. A detailed description of the hashing schema can be found in the clkhash documentation.

A linkage schema can either be defined as Python code as shown here, or as a JSON file (shown in other tutorials). The importance of each field is controlled by the k parameter in the FieldHashingProperties. We ignore the record id and social security id fields so they won’t be incorporated into the encoding.

[7]:

import clkhash
from clkhash.field_formats import *
schema = clkhash.randomnames.NameList.SCHEMA
_missing = MissingValueSpec(sentinel='')
schema.fields = [
    Ignore('rec_id'),
    StringSpec('given_name',
               FieldHashingProperties(ngram=2, k=15)),
    StringSpec('surname',
               FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('street_number',
                FieldHashingProperties(ngram=1,
                                       positional=True,
                                       k=15,
                                       missing_value=_missing)),
    StringSpec('address_1',
               FieldHashingProperties(ngram=2, k=15)),
    StringSpec('address_2',
               FieldHashingProperties(ngram=2, k=15)),
    StringSpec('suburb',
               FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('postcode',
                FieldHashingProperties(ngram=1, positional=True, k=15)),
    StringSpec('state',
               FieldHashingProperties(ngram=2, k=15)),
    IntegerSpec('date_of_birth',
                FieldHashingProperties(ngram=1, positional=True, k=15, missing_value=_missing)),
    Ignore('soc_sec_id')
]

Encoding¶

Transforming the raw personally identity information into CLK encodings following the defined schema. See the clkhash documentation for further details on this.

[8]:

from clkhash import clk
with open('a.csv') as a_pii:
    hashed_data_a = clk.generate_clk_from_csv(a_pii, ('key1',), schema, validate=False)

with open('b.csv') as b_pii:
    hashed_data_b = clk.generate_clk_from_csv(b_pii, ('key1',), schema, validate=False)

generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 1.78kclk/s, mean=645, std=43.8]
generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 1.35kclk/s, mean=634, std=50.3]

Create Linkage Project¶

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.

[9]:

project_spec = {
  "schema": {},
  "result_type": "mapping",
  "number_parties": 2,
  "name": "API Tutorial Test"
}
credentials = requests.post(url + 'projects', json=project_spec).json()

project_id = credentials['project_id']
a_token, b_token = credentials['update_tokens']
credentials

[9]:

{'project_id': 'e98ababc1a02a4057a13b39c846e9f219acf71bd0a4143c7',
 'result_token': '693c423c0c021f92a9f7b1658ef8f19beaa7b9c1b27ea22c',
 'update_tokens': ['57401d6c0edfa78abf3bd4a87936159f8c974f93dc352d21',
  '8c44139db950ca88f58f18d18e219f001fa105543a7b25e6']}

Note: the analyst will need to pass on the project_id (the id of the linkage project) and one of the two update_tokens to each data provider.

The result_token can also be used to carry out project API requests:

[10]:

requests.get(url + 'projects/{}'.format(project_id),
             headers={"Authorization": credentials['result_token']}).json()

[10]:

{'error': False,
 'name': 'API Tutorial Test',
 'notes': '',
 'number_parties': 2,
 'parties_contributed': 0,
 'project_id': 'e98ababc1a02a4057a13b39c846e9f219acf71bd0a4143c7',
 'result_type': 'mapping',
 'schema': {}}

Now the two clients can upload their data providing the appropriate upload tokens.

CLK Upload¶

[12]:

a_response = requests.post(
    '{}projects/{}/clks'.format(url, project_id),
    json={'clks': hashed_data_a},
    headers={"Authorization": a_token}
).json()

[13]:

b_response = requests.post(
    '{}projects/{}/clks'.format(url, project_id),
    json={'clks': hashed_data_b},
    headers={"Authorization": b_token}
).json()

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

Create a run¶

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

[21]:

run_response = requests.post(
    "{}projects/{}/runs".format(url, project_id),
    headers={"Authorization": credentials['result_token']},
    json={
        'threshold': 0.80,
        'name': "Tutorial Run #1"
    }
).json()

[22]:

run_id = run_response['run_id']

Run Status¶

[23]:

requests.get(
        '{}projects/{}/runs/{}/status'.format(url, project_id, run_id),
        headers={"Authorization": credentials['result_token']}
    ).json()

[23]:

{'current_stage': {'description': 'compute similarity scores',
  'number': 2,
  'progress': {'absolute': 25000000,
   'description': 'number of already computed similarity scores',
   'relative': 1.0}},
 'stages': 3,
 'state': 'running',
 'time_added': '2019-04-30T12:18:44.633541+00:00',
 'time_started': '2019-04-30T12:18:44.778142+00:00'}

Now after some delay (depending on the size) we can fetch the results. This can of course be done by directly polling the REST API using requests, however for simplicity we will just use the watch_run_status function provided in clkhash.rest_client.

Note the server is provided rather than url.

[24]:

import clkhash.rest_client
for update in clkhash.rest_client.watch_run_status(server, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))

State: completed
Stage (3/3): compute output

[25]:

data = json.loads(clkhash.rest_client.run_get_result_text(
    server,
    project_id,
    run_id,
    credentials['result_token']))

This result is the 1-1 mapping between rows that were more similar than the given threshold.

[30]:

for i in range(10):
    print("a[{}] maps to b[{}]".format(i, data['mapping'][str(i)]))
print("...")

a[0] maps to b[1449]
a[1] maps to b[2750]
a[2] maps to b[4656]
a[3] maps to b[4119]
a[4] maps to b[3306]
a[5] maps to b[2305]
a[6] maps to b[3944]
a[7] maps to b[992]
a[8] maps to b[4612]
a[9] maps to b[3629]
...

In this dataset there are 5000 records in common. With the chosen threshold and schema we currently retrieve:

[31]:

len(data['mapping'])

[31]:

Cleanup¶

If you want you can delete the run and project from the anonlink-entity-service.

[44]:

requests.delete(
    "{}/projects/{}".format(url, project_id),
    headers={"Authorization": credentials['result_token']})

[44]:

<Response [403]>

[ ]:

Table of Contents

This Page